Advances in Computers, Volume 112 by Unknown

Advances in Computers, Volume 112 by Unknown

Author:Unknown
Language: eng
Format: epub


ARTICLE IN PRESS

Data Warehouse Testing

31

Table 8 Examples of Achilles Data Quality Rules

Rule_id Data Quality Rule

Status

Description

19

Year of birth should not be Warning

Checks whether or not year

prior to 1800

of birth is less than 1800

32

Percentage of patients with Notification Checks whether or not the

no visits should not exceed

percentage of patients that

a threshold value

have no visit records is greater

than 5

need of the health data users. Achilles defines its data quality rules as SQL

queries while PEDSnet uses R. Table 8 shows examples of two data quality

rules that are validated in Achilles.

Note that the tools described in Section 4.1.1 to test the quality of the

underlying data in a data warehousing system can also be used to execute

data quality tests for the ETL process. The difference is that in the context

of ETL testing, the tools have a different purpose, which is to test any time

data is added or modified through the ETL process.

5.1.2 Balancing Tests

Balancing tests ensure that the data obtained from the source databases is not

lost or incorrectly modified by the ETL process. In this testing activity, data

in the source and target data warehouse are analyzed and differences are

reported.

The balancing approach called Sampling [65] uses source-to-target map-

ping documents to extract data from both the source and target tables and

store them in two spreadsheets. Then it uses the Stare and Compare technique

to manually verify data and determine differences through viewing or eye-

balling the data. Since this task can involve the comparison of billions of

records, most of the time, a few number of the entire set of records are ver-

ified through this approach.

IBM QuerySurge [65] is a commercial tool that was built specifically to

automate the balancing tests through query wizards. The tool implements a

method for fast comparison of validation query results written by testers [66].

The query wizards implement an interface to make sure that minimal effort

and no programming skills are required for developing balancing tests and

obtaining results. The tool compares data based on column, table, and record

count properties. Testers select the tables and columns to be compared in the

wizard. The problem with this tool is that it only compares data that is not

ARTICLE IN PRESS

32

Hajar Homayouni et al.

modified during the ETL transformation, which is claimed to be 80% of

data. However, the goal of ETL testing should also be to validate data that

has been reformatted and modified through the ETL process.

Another method is Minus Queries [65] in which the difference between

the source and target is determined by subtracting the target data from the

source data to show existence of unbalanced data. The problem with this

method is the potential for false positives. For example, as many data ware-

houses keep historical data, there may be duplicate records in the target data

warehouse corresponding to the same entity and the result might report an

error based on the differences in number of records in the source and the

target data warehouse. However, these duplications are actually allowed

in the target data warehouse.

We propose to identify discrepancies that may arise between the source

and the target data due to an incorrect transformation process. Based on

these discrepancies we define a set of properties, namely, completeness, con-

sistency, and syntactic validity.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.